0. General introduction

General comments:

This tutorial is the written-up version by a lecture given by Dr. Anja Spang and Dr. Nina Dombrowski as part of a bioinformatics workshop in 2019 given for graduate students at the Royal Netherlands Institute for Sea Research (NIOZ) in the Netherlands.

The goal of this tutorial is to learn how to run a phylogeny using single as well as concatenated marker genes.

We have three proteins that are found in ~300 archaeal genomes and we want to create a phylogenetic tree for all three single proteins as well as generate a concatenated proteint tree.

This notebook will consist of two parts:

  1. A theoretical part going into the various aspects that are useful to know for phylogenetic analyses.
  2. A practical part. Here, we provide a set of protein sequences for which we want to generate trees.

The tutorial works on the NIOZ server named ada, where most tools are installed. The exception are custom scripts, which will be provided as part of this tutorial. If you work on a different system you need to set up programs, such as mafft, yourself and change file paths if needed.

1. Theory

What is a phylogenetic tree and how do we read it?

Sometimes trees are also represented as a cladogram in which branch lengths do not correspond to the amount of character change. Below we see a phylogram next to a cladogram.

Some basic terminology

  1. Monophyly

The yellow dots depict synapomoprhies: –> shared derived characters that are present in all members of a phylogenetic group and its ancestor.

  1. Polyphyly

  1. Convergent versus parallel evolution
  • Divergent evolution: Groups from the same common ancestor evolve and accumulate differences, resulting in the formation of new species. Divergent evolution may occur as a response to changes in abiotic factors, such as a change in environmental conditions, or when a new niche becomes available.
  • Convergent evolution: Organisms not closely related (not monophyletic), independently evolve similar traits as a result of having to adapt to similar environments or ecological niches. Convergent evolution occurs when descendants resemble each other more than their ancestors did with respect to some feature. Parallel evolution implies that two or more lineages have changed in similar ways, so that the evolved descendants are as similar to each other as their ancestors were.

  1. Rooted versus unrooted trees

A rooted tree is a tree in which one of the nodes is defined as the root, and thus the direction of ancestral relationships is determined. In contrast, an unrooted tree has no pre-determined root and therefore shows no hierarchy and we do not know in which direction evolution is happening. Therefore, in this case, the distance between the nodes should be symmetric (since the tree edges are not directed). Rooting an unrooted tree involves inserting a new node, which will function as the root node.

The most widely known way to root is outgroup rooting. In this case, we need to use external knowledge (or make a hopefully reasonable assumption) to identify at least one species which we know to be outside of the rest of the study group but not so far away that character homology becomes difficult to establish.

An alternative way to root is midpoint rooting. In this case, the longest distance between two terminals on the tree is identified and the root is then placed precisely in the middle of that distance. The assumption behind midpoint rooting is that character changes across the phylogenetic tree are approximately clock-like, that is they happen approximately at the same speed in every lineage. Sometimes this is a reasonable assumption, as when one has many neutral characters and species with approximately similar characteristics. But there are many situations in which evolution is far from clock-like: if characters are under selection, if for some reason lineages evolve at very different speeds, or if there are appreciable amounts of missing data in the dataset, midpoint rooting will be mislead.

How do we select good markers?

To select markers we need to make sure that sequences are orthologous and have a shared ancestry.

  • Orthologs are genes that in different species evolved from a common ancestral gene.
  • Paralogs are gene copies created by a duplication event within the same genome.
  • While orthologous genes kept the same function, paralogous genes often develop different functions due to missing selective pressure on one copy of the duplicated gene.

Additionally, selecting a good marker depends on the scientific question? Do we want to:

  • Approximate a species tree? Then we can use ribosomal RNA (i.e. the 16S rRNA gene) or universal, conserved and single-copy marker genes or proteins.
  • study the evolution of enzyme families? THen we can look at genes or protein sequences belonging to a specific family, such as the ATP synthase.

One important thing to remember is that a RNA/gene or protein does not necessarily equal the species tree. For example, differences between the two can be due to horizontal gene transfers, imcomplete lineage sorting, gene duplications or gene losses.

Alignments

When we align sequences we want to “sort” homologous sequences so that the form columns. There are several algorithms to do this:

  • Progressive alignments: Calculate pairwaise distances between all sequences. Then compute a fast tree, usually using neighbour joining methods. And then build a progressive alignment using the branching order of the tree. This means that two species that are closest in the tree are aligned first. Then the sequences are treated as one pair of sequences (fixing gaps) and aligned to the next closest sequence. This is repeated until all sequences are aligned. Programs that use this approach are ClustalW and ClustalX.

The benefit of this approach is that it is fast and can be used for a large number of sequences. However, if a gap is introduced early in the alignment (even though there should be none) then this error is carried throughout the full alignment and can create quite some issues.

If we have introduced alignment errors we can manually clean this.

  • Iterative alignments: This approach gradually optimizes alignments. This alignment is based on a progressive alignment combined with a scoring system and refinement steps. Due to the refinement steps this approach is slower but much more accurate. Programs that use this approach are mafft and muscle.

Personal recommendation: If you have less than 1000 sequences use mafft_linsi if you have more use mafft.

Trimming and visualizing alignments

As we have mentioned above, alignments are not perfect. They can have gappy regions, regions that are poorly conserved (i.e. the ends) or regions that are hard to align.

Therefore, we use trimming methods to remove such reasons and improve our alignments.

Some examples of available tools are: - TrimAL - BMGE

Due to the issues with alignments it is always recommended to visually inspect your alignment. Tools to do this are: - seaview - jalview - …

Reconstructing trees

We will not go into detail but there are different methods to construct a tree:

Neighbour joining:

  • Constructs a tree by sequentially finding pairs of the nearest neighbours based on a distance matrix. It corrects for multiple substitutions, i.e. mutational saturation.
  • Uses a clustering algorithm for tree reconstruction and then optimizes length of internal branches
  • This approach is very fast but looses information (since every sequences is only presented by one state)
  • Uses: To generate a guide tree, such as the ones needed or aligning sequences.

Maximum parsimony

  • Choose the simplest explanation that fits the evidence
  • very sensitive to long branch attrachtion
  • can not correct for multiple substitutions –> underestimates true divergence
  • Can be used for small datasets, when we have slow rates of evolution or look at closely related species.

Long-branch attraction (LBA) is the grouping of highly divergent taxa as sister taxa if they are not in fact sisters. This grouping is usually based on the parallel accumulation of large amount of substitutions. I.e. convergent changes along branches are interpreted as similar due to common decent. LBA often occur in parsimony based methods but also occur in methods that rely on models of evolution, especially if these are too simple and do not well describe evolutionary processes.

Maximun likelihood (ML)

  • ML algorithms search for the tree that maximises the probability of observing character stats (i.e. the aligned positions) given a tree topology and a model of evolution
  • involved numerical optimisation techniques that find the combination of branch length and evolutionary parameters that maximises the likelihood.
  • This approach is computationally demanding and the resources needed dependent on the used model of evolution.

Commonly used tools:

  • FastTree: very fast, suitable for a first look or large datasets, rough approximation
  • RaxML: slower, possible to use more complex models of evolution
  • Iq-tree: slower, possible to use more complex models of evolution, extremely well documented and constantly updated

Bayesian trees

  • Character-state methods that use an optimally criterion
  • In contrast to MP and ML does not try to find the best tree.
  • Searches for a set of plausible trees for the data, i.e. so that the posterior distribution holds a confidence estimate of any evol. relationship
  • need to specifiy a prior believe, i.e. prior distrbution on model parameters, branch length and tree topology.
  • This approach is very slow

Commonly used tools:

  • MrBayes
  • PhyloBayes

Models of sequence evolution

For DNA

There exist many substitution models. I.e. the Jukes Cantor model

Substitution matrix : A two dimensional matrix with scores that describes the probability of one amino acid (or nucleotide) being replaced by another.

Other models account for different base frequences, such as the Felsenstein model.

We can also take into account that there are different frequencies of transversions versus transitions, such as in the Kimura 2 parameter model.

A more complex model is the general time reversible model (GTR) that works quite well for most data based on DNA sequences.

For proteins

Many options, i.e. in IQ-TREE:

And we can also optimize with these parameters in IQ-TREE with these settings:

Rate heterogenity:

Nucleotide substitution rates vary for different positions in the sequence, i.e. the third position in codons often mutates faster.

Additionally, the genetic code is degenerate (i.e. redundant) and transitions are less likely to change amino acids.

So how can we account for rate heterogeneity?

  • We need to model rate distributions over sites
  • i.e. we can use gamma ditributions

Some options to account for this in IQ-TREE are:

Protein mixture models

Standard protein substitution models use a single amino acid replacement matrix but site evolution is highly heterogenous. Therefore, a single relpacement matrix is often not enough to represent all the complexity of evol. processes. To overcome this we can combine several amino acid replacement matrices to better fit protein evolution.

Options in IQ-TREE are:

For all models more complex models usually generate a tree that will better fit the data, i.e. the tree will have a higher likelihood. But a more complex model will have more free parameters to estimate and thus might have a greater error (i.e. variance). Therefore, the simplest model that is good enough to model your data is the best model.

Choosing the best models

What model we choose depends on our data but we have to keep in mind that we can generate erroneous trees if our model is too simple to describe our data. However, if we chose models with to many parameters (i.e our model is too complex) then we can also get a tree with a wrong topology.

To just give one example:

We can determine the best model using IQ-TREE:

  • with the standard model selection (-m TEST option) or the new ModelFinder (-m MFP) option
  • these models automatically select the best-fit model for our phylogenetic analyses

Confidence: Can I trust my trees?

Why do we need bootstrapping?

Various methods allow to assess the confidence in branching patterns or branch supports.

Methods that are implemented in IQ-TREE:

  • bootstrapping
  • ultrafasr bootstrap approximation
  • SH-like apporximate likelihood tests

For Bayesian trees, i.e. PhyloBayes, we can use:

  • posterior predictive (PP) tests

Below we see an example that explains the rationale behind bootstrapping

2. Practical part

The general workflows for phylogenetic analyses (generally) is as follows:

  1. Identify your marker genes/proteins
  2. Align your marker genes
  3. Trim alignments
  4. When working with more than one gene/protein AND when these have the same evolutionary history: concatenate alignents
  5. Run phylogenetic analyses
  6. Visualize results

In this tutorial we will go all these three steps.

For this workflows it is generally recommended to put files for different parts of the workflow into different subfolders. Here, our structure will look as follows:

Accessing your files

Let’s first start with how to get to your files. For this we will assume that you will have a folder with the name Tutorial in your Desktop and the scripts and protein sequences in their respective folders (see the structure in the example above).

From now on, whenever there is a grey box this is an example for code we can use.

The structure for using commands usually is

the command we want to run followed by on what file we run it

Notice, you need to run this in the folder you downloaded the example files into.

#change your working director (cd = change directory)
#cd <path_to_your_working_dir>

#list all files in your directory (using ls, the list command, the * is a wildcard and lists everything)
ll *

#if we want to list a specific file type, i.e txt files
ll *txt

#if we want to check how our file looks we can view the first 10 lines with head
head Input_files/1_Protein_Seqs/TIGR00283.faa

#we make new directories with the mkdir command
mkdir backup

#if you want to copy files to the backup folder, we can use copy (cp) command
#here we first type the command (cp), then what we want to copy (our faa file) and then were we want to copy (to the backup folder)
cp Input_files/1_Protein_Seqs/TIGR00283.faa backup

#if we want to copy all three files, we can use the wildcard again
cp Input_files/1_Protein_Seqs/*.faa backup

Basic commands for bash and accessing the terminal can be checked in the General notebook.

Identify your marker genes/proteins

Since identifying genes/proteins usually takes a bit longer, below is just an example on how you for example can run a search. As an alternative, you can also search for sequences of interest in databases, i.e. NCBI.

Below we run the tool hmmsearch to search for Protein profiles in our genomes of interest.

hmmsearch Genomes.faa ProteinProfiles.hmm -E 1e-5 > Output.txt

A general hint for dealing with larger files:

  • Have a useful header that describes what is part of your file
  • Be short, concise and do not use extra symbols (DO NOT use space, try to limit yourself to - or _)
  • If you want to concatentate files, they need to have a common part in the sequence header. I.e. if your genome (Genome1) has all our three proteins then this genome name should be part of the sequence header.

In our example we have a very simple header, for each protein sequence we just have the genome ID in the header of genomes that have one of our 3 proteins. Usually that is a bit too simplistic (since we now do not know what protein in our genome encoded for our proteins of interest) but makes the workflow a bit easier for us.

The three proteins we have are:

  • TIGR00283: peptidyl-tRNA hydrolase, average of 114 amino acids
  • TIGR01008: ribosomal protein uS3, average of 195 amino acids
  • TIGR03670: DNA-directed RNA polymerase subunit B, average of 600 amino acids

If we would look at the sequences with a sequence viewer we would see something like this:

Align marker genes

We have two options to do this:

  1. We run this gene for gene (for now we just put it into our working directory, but if you are more familiar with the command line best put this in their own subdirectories)
mafft-linsi --reorder --thread 4  1_Protein_seqs/TIGR03670.faa > TIGR03670.aln

This command we then can run 3x for all our proteins.

Here:

  • mafft_linsi is our tool we use for aligning sequences
  • –reorder is an option that reorders our sequences based on how similar they are
  • –thread defines with how many cpus we run our analysis
  • > redirects the output of a command to a file
  1. If we want to run all three files at once we can run a loop
#first we create a list for our three files we want to loop through
ls Input_files/1_Protein_seqs/*faa | sed 's/\.faa//g' > FileList

#then we can use the FileList to build our loop
for i in `cat FileList`; do mafft-linsi --reorder --thread 4 ${i}.faa > 
${i}.aln; done

Loops always have the structure used above and you see that the actual mafft command more or less stays the same. The variable i stands for the 3 protein files that are listed in FileList

After we have done this our file should look something like this:

Looking at the highlighted areas we can already see that there are some potentially problematic regions:

  • the ends usually do not look too good because they are more difficult to align
  • often we have a few sequences with insertions

Due to these issues it often makes sense to trim the alignment.

Trimming

For now, we just run the program 3x for each of our three proteins. If you feel comfortable, try this in a loop.

java -jar /opt/biolinux/BMGE-1.12/BMGE.jar -i TIGR03670.aln -t AA -g 0.2 -m BLOSUM30 -h 0.55 -of TIGR03670_trimmed.aln

The options here are:

  • -t the input file, which contains our trimmed sequence alignment
  • -g maximum gap rate (range 1-0; 0.2 default)
  • -m BLOSUM matrix to use (default 62)
  • -h Maximum entropy threshold (range 1-0; default 0.5)
  • -of the name of the output file and output format

This would look like this in a loop:

for i in `cat FileList`; do java -jar /opt/biolinux/BMGE-1.12/BMGE.jar -i ${i}.aln -t AA -g 0.2 -m BLOSUM30 -h 0.55 -of ${i}_trimmed.aln; done

When this runs we should see something like this:

Here, bmge reports how long the sequences initially were and how many characters were removed.

If we check our alignment after trimming, we expect to see something like this:

The higlighted areas are the regions we before labelled as “problematic”. We can see that they have been removed.

Depending on your settings, i.e. having a very short protein, BMGE sometimes can be very stringent. Here, you can always play with the settings and control the gap penalties or try TrimAL as an alternative.

Generate a phylogenetic tree

As discussed above there are several tools to use. We will work with iq-tree, which as some good documentation here:

http://www.iqtree.org/doc/

As well as slides from an IQ-TREE workshop:

http://www.iqtree.org/workshop/

Depending on your job and the capacity of the server/computer using: Always check what resources you have as some trees need to have ~20 CPUs to run well. You do not want to clog the system and stop other users from running their job.

Examples for running iqtree with model selection

These are just somes examples and you can play with them on your own.

The basic command to run an alignment and test for different models is:

#Run an extended model selection that additionally includes FreeRate model and  run tree with the best model found
iqtree -s alignment.aln -m MFP

This will test all models known to iqtree, if you want to speed this up you can choose a certain set of models.

#Run an extended model selection that additionally includes FreeRate model and  run tree with the best model found
iqtree -s alignment.aln -m MFP -mset WAG,LG,JTT

By default, substitution models are not included in these tests. If we want to test them we have to add them. Generally, it is recommended to include them in the test and the following selection would be quite comprehensive for testing models.

#Run an extended model selection that additionally includes FreeRate model and  run tree with the best model found
iqtree -s alignment.aln -m MFP -madd LG+C10,LG+C20,LG+C30,LG+C40,LG+C50,LG+C60,LG+C10+R+F,LG+C20+R+F,LG+C30+R+F,LG+C40+R+F,LG+C50+R+F,LG+C60+R+F

Run a tree on our protein files

For now, lets try to test one of our simpler models, which has the benefit that it runs quicker.

iqtree -s TIGR03670_trimmed.aln -m JTT -nt 4 -bb 1000 -pre TIGR03670_JTT

The options are:

  • -s Input alignment that can be provided in PHYLIP/FASTA/NEXUS/CLUSTAL/MSF format
  • -m model to use (here: JTT)
  • -nt number of threads (use AUTO to determine best fit)
  • -bb Ultrafast bootraping (using 1000 replicates)
  • -pre Prefix for all outputfiles (useful if we need to rerun trees to i.e. test different models)

Again, we can also run this in a loop like this:

for i in `cat FileList`; do iqtree -s ${i}_trimmed.aln -m JTT -nt 4 -bb 1000 -pre treefiles/${i}_JTT; done

For practice, there are several combinations you can try, such as:

  1. JTT vs LG
  2. LG vs LG+F+R
  3. LG+F+R versus LG+C10+F+R
  4. run a model test

Notice: the more complex the model, the more cpus you would need. Setting -nt auto automatically chooses the best amount of CPUs. When using this make sure what your system allows.

What tree files look like

There are two general outputfiles from phylogenetic programs.

  1. Nexus files

These look like this (theory)

These look like this (text format)

You can see in the example above, that a the syntax of nexus files uses brackets, colons and dots. Therefore, it is recommended to avoid these symbols for your genome names/in your fasta headers.

  1. Nexus files

These files have a slightly different syntax:

 #NEXUS
 Begin data;
 Dimensions ntax=4 nchar=15;
 Format datatype=dna missing=? gap=-;
 Matrix
 Species1   {{DNA sequence|atgctagctagctcg}}
 Species2   {{DNA sequence|atgcta??tag-tag}}
 Species3   {{DNA sequence|atgttagctag-tgg}}
 Species4   {{DNA sequence|atgttagctag-tag}}           
 ;
 End;

What file format is generated depends a bit on the tool you use but there are scripts that can be found online to convert between the two file formats. For now we will work with newick files.

Renaming files

When opening our tree file, we see that the genome have a rather non-informative header. Often, having an easy and short header makes scripts easier but is not useful if we want to visualize the tree. However, if we provide a mapping file, we can replace the names with something more useful.

A mapping file should consist of two columns that are tab-separated. The first column should be the name exactly as it appears in our tree file, the second column has the name we want to replace it with. I.e. like this

The mapping file is provided in the first folder as well as the script we will use to search and then replace our names as follows:


perl Replace_tree_names.pl names_to_replace TIGR01008_JTT.treefile > TIGR01008_JTT.treefile_renamed

And a loop would look like this:


for i in *treefile; do perl ../Replace_tree_names.pl ../names_to_replace ${i} > ${i}_renamed; done

Visualizing tree files

There are several tools for this. One example is the web-based tool iTOL but for this example we have a look at the software figtree, which is free to use.

To open our file in figtree we can use either the GUI interface or type

figtree TIGR03670_LG.treefile_renamed

In Figtree we have several options to change the tree such as:

  • Change the color by: File –> Annotations –> load Annotations.txt (open to see the structure of such a file)
  • Read in the boostrap values by going to the extending the left hand side panel for branch lables –> select display and here select the label
  • Root the tree by extending the Trees panel and selecting Root tree (uses midpoint rooting). If we know an outgroup we can manually root
  • Order our tree, i.e. beautify b extending the trees panel and ordering the nodes.

Then we should see something like this

With Annotations.txt we can for example color the different phyla different and thus very quickly can see, whether our tree looks ok or not.

Concatenating multiple proteins

If we want to run species trees it is recommended to concatenate several genes/proteins since this way we can increase the amount of information have and thus better resolve deeper branches.

If you want to do this, the following requirements need to be fullfilled:

  • The sequences must have the same name
  • The sequences must have the same length (this is why we will concatenate after the alignment and trimming step)
  • This approach is only suitable if your marker gene occurs only 1x in your genome (beware of paralogs and contaminations)
  • Your proteins have the same evol. history, i.e. for ribosomal proteins

In theory if we would concatenate two proteins, we would do something like this:

Now in practice we can concatenate two (or more) sequences using a script like:

perl catfasta2phyml.pl -f -c TIGR01008_trimmed.aln TIGR03670_trimmed.aln > Concat_trimmed.aln

Then we can run the tree and rename the tree as we have done before

#run tree
iqtree -s Concat_trimmed.aln -m LG+G+F -nt 8 -bb 1000 -pre Concat_LG_GF

#clean up genome names
perl Replace_tree_names.pl names_to_replace Concat_LG_GF.treefile > Concat_LG_GF.treefile_renamed